Method of optimising the execution of a neural network in a speech recognition system through conditionally skipping a variable number of frames

ABSTRACT

A method of optimizing the execution of a neural network in a speech recognition system provides for conditionally skipping a variable number of frames, depending on a distance computed between output probabilities, or likelihoods, of a neural network. The distance is initially evaluated between two frames at times 1 and 1+k, where k is a predetermined maximum distance between frames, and if such distance is sufficiently small, the frames between times 1 and 1+k are calculated by interpolation, avoiding further executions of the neural network. If, on the contrary, such distance is not small enough, it means that the outputs of the network are changing quickly, and it is not possible to skip too many frames. In that case, the method attempts to skip remaining frames, calculating and evaluating a new distance.

FIELD OF THE INVENTION

The present invention relates generally to speech recognition systemscapable of recognizing spoken utterances, e.g. phrases, words or tokens,that are within a library developed by neural network-based learningtechniques. More particularly, the invention concerns a method ofspeeding up the execution of neural networks for optimising the systemperformance, and to a speech recognition system implementing suchmethod.

BACKGROUND ART

An automatic speech recognition process can be schematically describedby means of a plurality of modules, arranged sequentially between aninput vocal signal and an output sequence of recognised words:

a first signal processing module, for digitising the incoming vocalsignal; for example, for telephone speech, the sampling rate is 8000samples per second; the vocal signal is transformed from analogue todigital and opportunely sampled; the waveform is then divided into“frames”, where each frame is a small segment of speech that contains anequal number of waveform samples. In the following we assume a framesize of 10 msec, containing for example 80 samples (telephone speech);

a second feature extraction module, for computing features thatrepresent the spectral-domain content of the vocal signal (regions ofstrong energy at particular frequencies); these features are computedevery 10 msec, in correspondence with each frame;

a third module for pattern matching and temporal alignment; a Viterbialgorithm can be used for temporal alignment, for managing temporaldistortions introduced by different speech speeds, while a neuralnetwork (also called an ANN, multi-layer perceptron, or MLP) can be usedto classify a set of features into phonetic-based categories at eachframe;

a fourth linguistic analysis module, for matching the neural-networkoutput scores to the target words (the words that are assumed to be inthe input speech), in order to determine the word that was most likelyuttered.

In the above mentioned process the neural networks are used in the thirdmodule as regards the acoustic pattern matching, for estimating theprobability that a portion of a vocal signal belongs to a particularphonetic class, chosen in a set of predetermined classes, or constitutesa whole word in a predetermined set of words.

It is well known that the execution of a neural network, when it iscarried out by emulation on a sequential processor, is very burdensome,especially in cases requiring networks with many thousands of weights.If the need arises to process, in real time, signals continuouslyvarying through time, such as for speech signals, use of this technologytakes on additional difficulties.

A first attempt to solve such problem has been made in EP 0 733 982,wherein a method of speeding the execution of a neural network forcorrelated signal processing is disclosed. The method is based upon theprinciple that, since the input signal is sequential and evolves slowlyand continuously through time, it is not necessary to compute again allthe activation values of all neurons for each input, but rather it isenough to propagate through the network the differences with respect tothe previous input. That is, the operation does not consider theabsolute neuron activation values at time t, but the differences withrespect to activation values at time t−1. Therefore at any point of thenetwork, if a neuron has, at time t, an activation that is sufficientlysimilar to that of time t−1, that neuron does not propagates any signal,limiting the activity to only neurons having an appreciable change inthe activation level. The method disclosed in EP 0 733 982 allows asaving, in terms of running-times, of about ⅔ of the original runningtime.

A second method for reducing the load on a processor when running aspeech recognition system is disclosed in document U.S. Pat. No.6,253,178. Such method includes two steps, a first step of calculatingfeature parameters for a reduced set of frames of the input speechsignal, decimated to select K frames out of L frames of the input speechsignal according to a decimation rate K/L. The result of the first stepis a first series of recognition hypothesis whose likelihood issuccessively re-calculated (re-scoring phase) by the second recognitionstep, which is more detailed and uses all the input frames. Although theexecution of the first step allows to reduce computing times, the secondrecognition step requires however high processing load. Moreover the twostep recognition technique (coarse step and detailed step) has a basicproblem, if the first step misses a correct hypothesis, such hypothesiscannot any more recovered in the second step.

A further well known technique for speeding the execution of a speechrecognition system provides for skipping one or more frames in thoseregions where the signal is stationary. Such technique in based, in theprior art, on measuring a cepstrum distance between features extractedfrom frames of the input signal, i.e. such distance is measured on theinput parameters of the pattern matching module.

An example of such technique is disclosed in “Modeling and EfficientDecoding of Large Vocabulary Conversational Speech”, Michael Finke,Jurgen Fritsch, Detlef Koll, Alex Waibel, Eurospeech 1999 Budapest. Insuch document the recognition process, in particular the acoustic modelevaluation, is sped up by a dynamic frame skipping technique. The frameskipping technique based on the idea of re-evaluating acoustic modelsonly provided the acoustic vector changed significantly from a time t toa time t+1. A threshold on the Euclidean distance is defined to triggerre-evaluation of the acoustics. To avoid skipping too many consecutiveframes only one skip is allowed at a time, i.e. after skipping one framethe next one must be evaluated. Such method, based on the cepstrumdistance between input parameters, is not accurate, as the distributionof the acoustic parameters is a “multimode” distribution, even in thesame acoustic class. As a consequence, frames having a high cepstrumdistance can actually belong to the same acoustic class. Moreover suchmethod does not allow to skip more then one frame a time.

The Applicant has tackled the problem of optimising the execution timeof a neural network in a speech recognition system, maintaining highaccuracy in the recognition process. To this purpose a method ofspeeding the execution of a neural network, allowing to skip a variablenumber of frames depending on the characteristics of the input signal,is disclosed.

The Applicant observes that the accuracy of a recognition process can bemaintained at high levels, even if more then one consecutive inputframes are skipped in those regions where the signal is supposed to bestationary, provided that the distance between non-consecutive frames ismeasured with sufficient precision.

The Applicant has determined that, if the measurement of such distanceis based on the probability distributions, or likelihoods, of thephonetic units computed by the neural network, such measurement can beparticularly precise.

In view of the above, it is an object of the invention to provide amethod of optimising the execution of a neural network in a speechrecognition system allowing to conditionally skip a variable number offrames of an input speech signal.

SUMMARY OF THE INVENTION

According to the invention that object is achieved by means of a methodof optimising the execution of a neural network in a speech recognitionsystem, by conditionally skipping a variable number of frames, dependingon a distance computed between output probabilities, or likelihoods, ofthe neural network. The distance is initially evaluated between twoframes at times t and t+k, where k is a predetermined maximum distancebetween frames, and if such distance is sufficiently small, the framescomprised between times t and t+k are calculated by interpolation,avoiding further executions of the neural network. If, on the contrary,such distance is not small enough, it means that the outputs of thenetwork are changing quickly, and it is not possible to skip too muchframes. In that case the method attempts to skip less frames (forexample k/2 frames), calculating and evaluating a new distance.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described, by way of example only, withreference to the annexed figures of drawing, wherein:

FIG. 1 shows schematically a sequence of frames of an input speechsignal;

FIG. 2 is a diagram showing a threshold segmented function used by amethod according to the present invention; and

FIG. 3 is a flow diagram showing an example of implementation of amethod according to the present invention.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

With reference to FIG. 1, a plurality of frames of a digitised inputspeech signal are schematically represented on a time axis T. Each frameis a small segment of speech, for example having a size of 10 ms,containing an equal number of waveform samples, for example 80 samplesassuming a sampling rate of 8 KHz.

The method according to the invention measures a distance between twonon-consecutive frames, for example frames 4 and 6 in FIG. 1,corresponding to time slots t and t+k on time axis T, for evaluating thepossibility of skipping the run of the neural network in correspondenceof one or more frames comprised between frames 4 and 6.

In order to measure such distance the method computes, for each frame 4and 6, a corresponding feature vector which is passed as an inputparameter to the neural network. The output of the neural network is aprobability, or likelihood, for each phonetic category, that the currentframe belongs to that category. The distance, as explained in detailhereinbelow, is computed between output parameters, or likelihoods, ofthe neural network.

If the distance measured between frames t and t+k is small enough, i.e.lower then a predetermined threshold, it is presumed that the inputspeech signal is in a stationary phase and the output parameters of theneural network are not changing. In such case the method decides that isnot necessary to calculate exactly, by means of the computation offeatures and the run of the neural network, the output parameters, orlikelihoods, corresponding to the intermediate frames between t and t+k.The likelihoods are therefore calculated by interpolation, for example alinear interpolation, between the likelihoods corresponding to frames tand t+k.

If, on the contrary, the distance is not small enough, i.e. equal orgreater then a predetermined threshold, it is presumed that the outputparameters of the neural network are in an unsteady phase, and it is notpossible to skip the run of the neural network in correspondence ofintermediate frames between t and t+k. In such case the method tries toskip a reduced number of frames, applying recursively the abovementioned procedure on sub-intervals of frames comprised within t andt+k. The method ends when all likelihoods in the main interval t, t+khave been calculated, by interpolation or by running the neural network.In the worst case all the likelihoods are calculated exactly by means ofthe neural network.

As the output parameters of the neural network, or likelihoods, can beinterpreted as probability distributions over acoustic units, thedistance is calculated as a distance between probability distributions,according to the Kullback symmetric distance formula:${{KLD}\left( {P_{1},P_{2}} \right)} = {\int{\left\lbrack {{p_{1}(y)} - {p_{2}(y)}} \right\rbrack\ln\quad\frac{p_{1}(y)}{p_{2}(y)}{\mathbb{d}y}}}$

where P₁ and P₂ are the probability distributions, or likelihoods.

The KLD function assumes a positive value which approaches zero when thetwo probability distributions P₁ and P₂ are identical.

The method can be implemented by means of an algorithm which makes useof a lookahead buffer for storing the whole interval of frames comprisedbetween t and t+k.

In the following the term “neural motor” is intended as the combinationof the implementation of the algorithm method, a feature extractionmodule and the neural network.

The method comprises an initialisation phase in which the neural motorstores a number of frames, received from the front-end, equal to thelength of the lookahead buffer, without outputting any lookahead valueto the matching module. After this initialisation phase the neural motorbecomes synchronous with the front-end and matching module, for reachinga final phase in which the buffer is emptied. The filling and emptyingoperations of the buffer take place alternately, i.e. the buffer is notrefilled until it has not been completely emptied, and the likelihoodsare calculated in a burst mode, only when the buffer is completely full.

The method operates according to the following main steps:

a) buffering a plurality N (where N=k+1) of input frames;

b) defining an interval corresponding initially to a main interval offrames delimited by a first 4 and a second 6 non-consecutive bufferedframes;

c) calculating, by means of the neural network, a first and a secondlikelihood corresponding to the frames delimiting the interval;

d) calculating a symmetric Kullback distance between the first and thesecond likelihoods;

e) comparing the Kullback distance with a predetermined threshold valueS and, in case the distance is lower than the threshold value S,calculating by interpolation between the first and the secondlikelihoods, the likelihood or likelihoods corresponding to the frame orframes comprised within the interval, or, in case the distance isgreater than the threshold value S, calculating, by means of the neuralnetwork, at least one likelihood corresponding to a frame comprisedwithin the interval;

f) applying recursively said steps c) to e) to each interval present asa sub-interval within said main interval, containing at least one framewhose likelihood has not been yet calculated, until all the likelihoodscorresponding to the frames in the main interval have been calculated.

The interpolation operation used in step e) can be, for example, alinear interpolation.

The main interval of frames coincides preferably with the totality N ofthe buffered input frames.

The accuracy of the method is influenced mainly by two parameters, thelength N of the lookahead buffer, which determines the maximum number ofskipped frames, and the value of the threshold S on the Kullbackdistance.

As regards the first parameter N, an optimal value has been found in N=7(max number of skipped frames=5).

The threshold value S influences directly the probability that a greaternumber of frames are skipped, and its choice can be determined, forexample, considering the dimension of the vocabulary of words to berecognised.

Assuming that the method is used for optimising the run of a neuralnetwork in a speech recognition already having optimal performances formanaging large vocabularies, it has been found that an optimal solutionis to use, as threshold value S, a fuzzy set as shown, for example, inFIG. 2. The said fuzzy set S, having a domain V corresponding to thepercentage of output units of the neural network used by the currentphonetic variability, is a linear segmented decreasing function. Itassumes a maximum value of 4.0 in a first segment 10 and a minimum valueof 1.0 in last segment 14, linearly decreasing from 4.0 to 1.0 in themiddle segment 12.

If the phonetic variability V is lower then 15% the threshold is set to4.0, while it is set to 1.0 when the phonetic variability V is comprisedbetween 80% and 100%.

A possible implementation, independent from the buffer length, of therecursive algorithm used for calculating the likelihoods correspondingto the frames buffered in the lookahead buffer, will now be illustrated.

The recalled function is: Do_run_skip( ) { <load lookahead with extremess and e>; run(s); run(e); Run_skip(s,e); }

where s and e are the extremes of the lookahead buffer, and the function“Run_skip(s,e)” is defined as follows: Run_skip(h,k) { If ((k−h) == 1)return; D = kld(h,k); If(D < sieve) Then interpolate(h,k); Else { C =(int)(h+k)/2; Run(C); Run_skip(h,G); Run_skip(C,k); } }

where “sieve” is the threshold value S and the auxiliary functions“run(k)”, “interpolate (h,k) and “kld(h,k)” are defined as: run(k):executes the run of the Neural Network on frame k; Kld(h,k) { dist = 0;for(i=0; i<noutputs; i++) { dist += (output_(h)[i] − output_(k)[i]) *log(output_(h)[i]/ output_(k)[i]); } return(dist); } interpolate(h,k) {for(i=0; i<noutputs; i++) { delta = (output_(h)[i] −output_(k)[i])/(k-h); for(j=1; j<(k-h-1); j++) { output_(h+j)[i] =output_(h)[i] + delta*j; } } }

where “output_(h)[i]” indicates the output of the unit i of the neuralnetwork at frame h, and noutputs is the number of the output units ofthe neural network.

Operatively, the neural motor implementing the method operates asfollows:

1. Initialisation. The neural motor, which incorporates the lookaheadbuffer, captures N frames in order to fill the buffer, without returningany parameter;

2. Run phase. The run phase can be different, depending on whether thelookahead buffer is in the filling phase or already filed:

-   -   2.1 When the buffer is not full the system is in a synchronous        phase of lookahead buffer filling and contemporaneous releasing        of likelihoods already calculated in a previous calculation        phase. At every step of this phase a single frame is acquired        from the front-end and buffered, releasing a pre-calculated        likelikhood. When the buffer is again full the calculation of        the likelihoods can start, according to point 2.2.    -   2.2 When the buffer is full the system calculates, according to        the method previously described, the likelihoods corresponding        to all the frames present in the lookahead buffer.

3. Final phase. At the end, when the input frames sequence ends, all thelikelihoods already calculated are sequentially released.

A speech recognition system implementing the above illustrated methodcomprises the following elements:

a lookahead buffer for storing a plurality N of input frames;

a distance evaluation unit for calculating the distance betweenlikelihoods;

a comparing unit for comparing the distance with the threshold value S;

an interpolation unit for calculating one or more likelihoods comprisedbetween two already calculated likelihoods.

In order to understand the operation of the method according to theinvention, an example of application will now be described in detail,with reference to FIG. 3. The flow diagram shown in FIG. 3 illustrateshow the method is applied in a case in which the length of the lookaheadbuffer is equal to 5 (max frames skipped=3). Lookahead length = 5<likelihoods P₁ and P₅ are calculated by means of the run of the neuralnetwork> (block 20) if <the distance between P₁ and P₅ is lower thatthreshold S> (block 22) then <likelihoods P₂, P₃ and P₄ are calculatedby linear interpolation between likelihoods P₁ and P₅> (block 24) else<likelihood P₃ is calculated by means of the run of the neural network(block 26) if <the distance between P₁ and P₃ is lower that threshold S>(block 28) then <likelihood P₂ is calculated by linear interpolationbetween likelihoods P₁ and P₃> (block 30) else <likelihood P₂ scalculated by means of the run of the neural network (block 32) if <thedistance between P₃ and P₅ is lower that threshold S> (block 34) then<likelihood P₄ is calculated by linear interpolation between likelihoodsP₃ and P₅> (block 36) else <likelihood P₄ is calculated by means of therun of the neural network (block 38) return <likelihoods P₁...P₅, (block40)

The method and system according to the present invention can beimplemented as a computer program comprising computer program code meansadapted to run on a computer. Such computer program can be embodied on acomputer readable medium.

Some advantages of the method are the following:

the skip of frames is performed only after a precise evaluation ofdistances between output likelihoods, avoiding errors due to hazardousskipping;

if, due for example to mismatch or noise, the output likelihoods presentan increased instability, the method reduces automatically the skiprate;

the method of optimisation is complementary to other optimisationmethods, for example to the method disclosed in previously citeddocument EP 0 733 982, in the name of the same Applicant.

1-13. (canceled)
 14. A method of executing a neural network in a speechrecognition system for recognizing speech of an input speech signalorganized into a series of frames, comprising: evaluating a distancebetween non-consecutive frames and selectively skipping the run of theneural network in correspondence of at least one frame between saidnon-consecutive frames; and calculating said distance as a distancebetween output likelihoods of said neural network.
 15. The methodaccording to claim 14, comprising the steps of: a) buffering a pluralityof input frames; b) defining an interval corresponding initially to amain interval of frames delimited by a first and a secondnon-consecutive buffered frames; c) calculating, by means of said neuralnetwork, a first and a second likelihood corresponding to the framesdelimiting said interval; d) calculating a distance between said firstand second likelihoods; e) comparing said distance with a predeterminedthreshold value and, in case said distance is lower than said thresholdvalue, calculating by interpolation between said first and secondlikelihoods, the likelihood or likelihoods corresponding to the frame orframes within said interval, or, in case said distance is greater thansaid threshold value, calculating, by means of said neural network, atleast one likelihood corresponding to a frame within said interval; andf) applying recursively said steps c) to e) to each interval present asa sub-interval within said main interval containing at least one framewhose likelihood has not been yet calculated, until all the likelihoodscorresponding to the frames in said main interval have been calculated.16. The method as claimed in claim 15, wherein said interpolation is alinear interpolation.
 17. The method as claimed in claim 15, whereinsaid main interval of frames comprises said plurality of buffered inputframes.
 18. The method as claimed in claim 15, wherein said likelihoodsare probability distributions.
 19. The method as claimed in claim 18,wherein said distance between said first and second likelihoods iscalculated as a symmetric Kullback distance between probabilitydistributions.
 20. The method as claimed in claim 15, wherein saidthreshold value is a fuzzy set.
 21. The method as claimed in claim 20,wherein said fuzzy set has a domain corresponding to the percentage ofoutput units of said neural network used by the current phoneticvariability.
 22. The method as claimed in claim 21, wherein said fuzzyset is a linear segmented decreasing function.
 23. A computer programcomprising computer program code means adapted to perform all the stepsof any one of claims 14 to 22, when said program is capable of being runon a computer.
 24. The computer program as claimed in claim 23, embodiedon a computer readable medium.
 25. A speech recognition system forrecognizing speech of an input speech signal, according to the method ofany one of claims 14 to 22, comprising: a neural network for calculatinglikelihoods corresponding to frames of said input speech signal,comprising: a buffer for storing a plurality of input frames; a distanceevaluation unit for calculating a distance between a first and a secondlikelihood, said first and second likelihoods being obtained by means ofsaid neural network and corresponding to a first and a secondnon-consecutive buffered frames; a comparing unit for comparing saiddistance with a predetermined threshold value; and an interpolation unitfor calculating, in case said distance is lower than said thresholdvalue, the likelihood or likelihoods corresponding to the frame orframes between said first and second non-consecutive buffered frames.26. The speech recognition system according to claim 25, wherein saidbuffer is a lookahead buffer.